Skip to content

Conversation

ForVic
Copy link

@ForVic ForVic commented Aug 6, 2025

What changes were proposed in this pull request?

Add a config spark.driver.metrics.pollingInterval, and schedule driver polling interval / heartbeat at that schedule.

Why are the changes needed?

Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.

Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.

Does this PR introduce any user-facing change?

Yes, introduces a spark config

How was this patch tested?

Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.

Methodology for validating that increased driver heartbeat intervals would improve memory collection:

  1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM.
  2. Increased driver memory to 10gb, large enough for the same broadcast to succeed.
  3. Repeated this job and tracked the peak memory usage that was written to event log.
  4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB.
  5. Added my change, and decreased the heartbeat interval.
  6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 6, 2025
@ForVic ForVic marked this pull request as ready for review August 6, 2025 21:40
@ForVic
Copy link
Author

ForVic commented Aug 6, 2025

cc @mridulm @robreeves @shardulm94

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Scratch that - added a comment below

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+CC @dongjoon-hyun for thoughts as well.

@@ -359,6 +359,11 @@ package object config {
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("10s")

private[spark] val DRIVER_HEARTBEAT_INTERVAL =
ConfigBuilder("spark.driver.heartbeatInterval")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is mostly just for reporting metrics, I am more inclined to rename this from heartbeat to something which more clearly conveys the intent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about spark.driver.metricsReportingInterval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @mridulm 's comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see we already differentiate, spark.executor.metrics.pollingInterval, I will call this spark.driver.metrics.pollingInterval ?

@mridulm mridulm self-requested a review August 24, 2025 20:36
@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @mridulm . I'm catching up the community reviews this week.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the background of proposal sounds reasonable to me. I'll revisit later when the PR has the final code.

@ForVic ForVic changed the title [SPARK-53157][CORE] Decouple driver and executor heartbeat polling intervals [SPARK-53157][CORE] Decouple driver and executor polling intervals Aug 26, 2025
private[spark] val DRIVER_METRICS_POLLING_INTERVAL =
ConfigBuilder("spark.driver.metrics.pollingInterval")
.version("4.1.0")
.fallbackConf(EXECUTOR_HEARTBEAT_INTERVAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more fields like spark.executor.metrics.pollingInterval, e.g., .doc and .timeConf? In addition, it would be great if we can place this new configuration to somewhere near the driver configs live instead of here.

private[spark] val EXECUTOR_METRICS_POLLING_INTERVAL =
ConfigBuilder("spark.executor.metrics.pollingInterval")
.doc("How often to collect executor metrics (in milliseconds). " +
"If 0, the polling is done on executor heartbeats. " +
"If positive, the polling is done at this interval.")
.version("3.0.0")
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("0")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doc string, but we can't add both .timeConf and .fallbackConf and if user has non-default EXECUTOR_HEARTBEAT_INTERVAL, we should fallback to matching behavior, thoughts?

private[spark] val DRIVER_METRICS_POLLING_INTERVAL =
ConfigBuilder("spark.driver.metrics.pollingInterval")
.doc("How often to collect driver metrics (in milliseconds). " +
"If 0, the polling is done at the executor heartbeat interval. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Is this correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah typo - should be If unset

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @mridulm , sorry but I'm not sure this PR is going into the right direction. I'll leave this to you for the rest of reviews because it seems that LinkedIn requires this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants